Twitter is a gold mine of data. Unlike other social platforms, almost every user’s tweets are completely public and pullable. This is a huge plus if you’re trying to get a large amount of data to run analytics on. Twitter data is also pretty specific. Twitter’s API allows to do complex queries like pulling every tweet about a certain topic within the last twenty minutes, or pull a certain user’s non-retweeted tweets.
The main purpose of this task is to take 5 Bangladeshi cricket players official Twitter accounts and get some informations from there. Then analyse these data and do some comparative analysis of their social media accounts.
import requests
from twython import Twython
#from credentials import *
import pandas as pd
import plotly.express as px
import missingno as msno
%matplotlib inline
def data_from_twitter(query, **kwargs):
b = []
with open("twitter.json", "rb") as f:
a = json.load(f)
twitter = Twython(a["consumer_key"], a["consumer_secret"])
for q in query:
try:
a = twitter.search(q = q,tweet_mode="extended",**kwargs)
df2 = pd.json_normalize(a["statuses"])
df2["query"] = q
b.append(df2)#df2.loc[:,col])
except:
print(q, "does not exist")
continue
return pd.concat(b)
For collecting data, I have choosen 5 Bangladeshi Cricket Players official verified Twitter page. The players are:
Then I checked the dataFrame which has 428 rows and 325 columns. I have checked the columns information and choose the columns that has some numerical informations to process further informations. So I create a new data Frame with the necessary columns which has 428 rows and 11 columns.
I also change datatypes as required for the analysis.
#col = ["created_at","id", "place", "retweet_count", "possibly_sensitive", "user.location", "user.followers_count",
# "user.created_at", "user.favourites_count", "user.verified", "user.statuses_count",
# "user.withheld_in_countries", "place.country_code", "place.country", "place.bounding_box.coordinates"]
df = data_from_twitter(query= ["@Sah75official","@Mahmudullah30","@mushfiqur15","@TamimOfficial28",
"@Mustafiz90"],count=100)
col = ["query","created_at","id", "possibly_sensitive", "user.location", "user.followers_count",
"user.created_at", "user.favourites_count", "user.verified", "user.statuses_count","retweet_count"]
df1 = df.loc[:,col]
df
| created_at | id | id_str | full_text | truncated | display_text_range | source | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | ... | retweeted_status.place.id | retweeted_status.place.url | retweeted_status.place.place_type | retweeted_status.place.name | retweeted_status.place.full_name | retweeted_status.place.country_code | retweeted_status.place.country | retweeted_status.place.contained_within | retweeted_status.place.bounding_box.type | retweeted_status.place.bounding_box.coordinates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Tue Jun 22 11:21:36 +0000 2021 | 1407297714488893440 | 1407297714488893440 | @Manik_Sarkar81 @DAILYITTEFAQ @BhabnaAshna @mu... | False | [160, 443] | <a href="http://twitter.com/download/android" ... | 1.407186e+18 | 1407185967220330496 | 8.449953e+17 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Tue Jun 22 10:51:32 +0000 2021 | 1407290147750637570 | 1407290147750637570 | @Retainlytoken good project and good project a... | False | [0, 115] | <a href="http://twitter.com/download/android" ... | NaN | None | 1.404898e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Tue Jun 22 09:48:21 +0000 2021 | 1407274247441960963 | 1407274247441960963 | @Nitin7304 This is my playing XI team in all w... | False | [11, 289] | <a href="http://twitter.com/download/android" ... | 1.407190e+18 | 1407190318810767360 | 1.299690e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Tue Jun 22 08:14:36 +0000 2021 | 1407250656251944962 | 1407250656251944962 | @dyldohvx good project and good project and @B... | False | [10, 110] | <a href="http://twitter.com/download/android" ... | 1.392995e+18 | 1392995129686777857 | 1.384933e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Tue Jun 22 07:35:43 +0000 2021 | 1407240869346832390 | 1407240869346832390 | @Extracoinorg I really like the project a lot\... | False | [14, 220] | <a href="http://twitter.com/download/android" ... | 1.406859e+18 | 1406858851052179456 | 1.404741e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | Fri Jun 18 18:27:16 +0000 2021 | 1405955289061498884 | 1405955289061498884 | @greatmoonbsc It’s very good project \n \n#air... | False | [14, 157] | <a href="https://mobile.twitter.com" rel="nofo... | 1.405853e+18 | 1405852586700414980 | 1.400047e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 96 | Fri Jun 18 13:36:07 +0000 2021 | 1405882015723888651 | 1405882015723888651 | @BCBtigers @Mustafiz90 Lol | False | [23, 26] | <a href="http://twitter.com/download/android" ... | 1.405759e+18 | 1405759169165086730 | 3.165771e+08 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 97 | Fri Jun 18 12:50:00 +0000 2021 | 1405870409958838276 | 1405870409958838276 | On this day in 2015, @Mustafiz90 played his de... | False | [0, 256] | <a href="http://twitter.com/download/android" ... | NaN | None | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 98 | Fri Jun 18 12:14:43 +0000 2021 | 1405861532219723776 | 1405861532219723776 | RT @BCBtigers: On this day 6 years ago, .@Must... | False | [0, 140] | <a href="http://twitter.com/download/android" ... | NaN | None | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 99 | Fri Jun 18 11:03:43 +0000 2021 | 1405843664362819589 | 1405843664362819589 | @Hpns_official Nice Project \n@ltc_angel @m @M... | False | [15, 54] | <a href="https://mobile.twitter.com" rel="nofo... | 1.405814e+18 | 1405814364880338947 | 1.388242e+18 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
375 rows × 248 columns
df.shape
(375, 248)
df1
| query | created_at | id | possibly_sensitive | user.location | user.followers_count | user.created_at | user.favourites_count | user.verified | user.statuses_count | retweet_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @Sah75official | Tue Jun 22 11:21:36 +0000 2021 | 1407297714488893440 | False | پاکستان | 108 | Wed Apr 14 09:58:21 +0000 2021 | 1 | False | 344 | 0 |
| 1 | @Sah75official | Tue Jun 22 10:51:32 +0000 2021 | 1407290147750637570 | NaN | Bangladesh | 2 | Sat Jun 19 09:37:04 +0000 2021 | 84 | False | 188 | 0 |
| 2 | @Sah75official | Tue Jun 22 09:48:21 +0000 2021 | 1407274247441960963 | NaN | Badvel | 2 | Sun Jun 13 15:59:04 +0000 2021 | 228 | False | 188 | 0 |
| 3 | @Sah75official | Tue Jun 22 08:14:36 +0000 2021 | 1407250656251944962 | NaN | Bangladesh | 2 | Sat Jun 19 09:37:04 +0000 2021 | 84 | False | 188 | 0 |
| 4 | @Sah75official | Tue Jun 22 07:35:43 +0000 2021 | 1407240869346832390 | NaN | 1 | Thu Jun 17 07:26:31 +0000 2021 | 63 | False | 77 | 0 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | @Mustafiz90 | Fri Jun 18 18:27:16 +0000 2021 | 1405955289061498884 | NaN | 3 | Thu May 20 09:10:52 +0000 2021 | 87 | False | 156 | 0 | |
| 96 | @Mustafiz90 | Fri Jun 18 13:36:07 +0000 2021 | 1405882015723888651 | NaN | Khulna, Bangladesh | 5 | Fri Nov 18 03:15:10 +0000 2016 | 529 | False | 72 | 0 |
| 97 | @Mustafiz90 | Fri Jun 18 12:50:00 +0000 2021 | 1405870409958838276 | False | Dhaka, Bangladesh | 297 | Fri Nov 22 10:04:10 +0000 2019 | 168930 | False | 6339 | 0 |
| 98 | @Mustafiz90 | Fri Jun 18 12:14:43 +0000 2021 | 1405861532219723776 | NaN | Barisal, Bangladesh | 45 | Fri Nov 29 04:31:25 +0000 2019 | 13678 | False | 531 | 11 |
| 99 | @Mustafiz90 | Fri Jun 18 11:03:43 +0000 2021 | 1405843664362819589 | NaN | 2 | Sun Mar 21 14:44:05 +0000 2021 | 38 | False | 90 | 0 |
375 rows × 11 columns
df1["created_at"] = pd.to_datetime(df1["created_at"])
df1["user.created_at"] = pd.to_datetime(df1["user.created_at"])
df1.dtypes
query object created_at datetime64[ns, UTC] id int64 possibly_sensitive object user.location object user.followers_count int64 user.created_at datetime64[ns, UTC] user.favourites_count int64 user.verified bool user.statuses_count int64 retweet_count int64 dtype: object
df1.isna().sum()
query 0 created_at 0 id 0 possibly_sensitive 310 user.location 0 user.followers_count 0 user.created_at 0 user.favourites_count 0 user.verified 0 user.statuses_count 0 retweet_count 0 dtype: int64
msno.bar(df1);
a = df1[["query","created_at"]]
a
df1["date"]= df1['created_at'].dt.date
#df1.groupby([["query","df1.created_at.dt.hour"]])
#print(df1.date)
#print(df1['created_at'].dt.time)
#print(a['created_at'].dt.strftime('%H:%M'))
fig = px.area(a, x = a['created_at'].dt.date, facet_col="query", facet_col_wrap=3,color_discrete_sequence=px.colors.qualitative.Pastel2,title='7 days time series analysis with frequncy of Tweets of the four players')
fig.show();
df2 = df1.groupby(["query"]).sum().reset_index()
#final = df2.rename(columns={"query":"players_name"})
#final
df2
| query | id | user.followers_count | user.favourites_count | user.verified | user.statuses_count | retweet_count | |
|---|---|---|---|---|---|---|---|
| 0 | @Mahmudullah30 | 2.952423e+19 | 495.0 | 43350.0 | 0 | 6763.0 | 0.0 |
| 1 | @Mustafiz90 | 1.406576e+20 | 4486571.0 | 1375959.0 | 2 | 627571.0 | 2170.0 |
| 2 | @Sah75official | 1.406947e+20 | 22350.0 | 175737.0 | 0 | 157112.0 | 46621.0 |
| 3 | @TamimOfficial28 | 7.590975e+19 | 287907.0 | 597684.0 | 2 | 1177486.0 | 72.0 |
| 4 | @mushfiqur15 | 1.406132e+20 | 2812778.0 | 1186940.0 | 1 | 766366.0 | 276.0 |
sensitive_tweets = df1.groupby(by="query")["possibly_sensitive"].count()
s = sensitive_tweets.reset_index()
fig = px.bar(s,x = "query",y="possibly_sensitive", color = "query",
color_discrete_sequence=px.colors.qualitative.Pastel1, title="Compare based on Players Sensitive Tweets",)
fig.show();
user_followers_count = df2[["query","user.followers_count",]]
fig = px.bar(user_followers_count, x = 'query',y="user.followers_count",color="query",
color_discrete_sequence=px.colors.qualitative.Pastel1, title="Compare based on User followers count")
fig.show();
retweet_count = df2[["query","retweet_count"]]
fig = px.bar(retweet_count, x = 'query',y="retweet_count",color="query",
color_discrete_sequence=px.colors.qualitative.Pastel, title="Compare based on Retweet count")
fig.show();
user_statuses_count= df2[["query","user.statuses_count"]]
fig = px.pie(user_statuses_count, values='user.statuses_count', names='query',
color_discrete_sequence=px.colors.qualitative.Pastel2,
title='Comparison based on users statuses count')
fig.show()
user_favourites_count = df2[["query","user.favourites_count"]]
fig = px.scatter(df, x="query", y="user.favourites_count", title='Comparison based on user favorite count')
fig.show();
It is a very limited data and only a limited analysis has been done. The results I got is in between 8 June 2021 to 16 June 2021. Results I got from these analysis are: